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Abstract 

The Brier score is frequently used by meteorologists to measure the skill of binary probabilistic 
forecasts. We show, however, that in simple idealised cases it gives counterintuitive results. We 
advocate the use of an alternative measure that has a more compelling intuitive justification. 

1 Introduction 

Users of meteorological forecasts need to be able to judge which forecasts are the best in order to decide 
which to use. We distinguish two cases. The first case is one in which the user plans to use the forecast for 
making a certain specific decision the details of which can be specified entirely in advance. The second is 
one in which the user plans to use the forecast for making one or more decisions which cannot be specified 
in detail in advance. 

In the first case it may be possible to decide which forecast is the best by analysing the effect of using dif- 
ferent forecasts on the quality of the final decisions made (for an example of this situation see iRichardsonI 
(|20o3))- In the second case, however, the user cannot convert forecasts into decisions ahead of time 
because they do not know what decisions they are going to have to make. By the time they know what 
decision they are going to have to make, they do not have time to re-evaluate the available forecasts and 
potentially switch to a different forecast provider. In this second case forecasts have to be analysed and 
compared on their own merits, rather than on the merits of the decisions that can be based on them. In 
such a situation, the forecast user needs standard measures which can distinguish between forecasts at a 
general level^. It is this second case that we will consider. 

Forecasts can be divided into forecasts of the expectation of future outcomes and probabilistic forecasts 
that give probabilities of different outcomes. Probabilistic forecasts can then be divided into continuous 
and discrete probabilistic forecasts. A continuous probabilistic forecast gives a continuous densit y for the 
distribu tion of possible outcomes. We have discussed how to measure the skill of such forecasts in ljewsoni 
and have applied t he measures we prop ose t o the cal i bration and the comparison of forecasts in 
a number of studies such as I Jewson et alJ l)2003|) and lJewsonl l)2003a|) . 

Discrete probabilistic forecasts give probabilities for a number of discrete events. Any number of events 
can be considered, but in this article we will restrict ourselves to the case of only two events, which we 
call a binary probabilistic forecast. We will address the question of how binary probabilistic forecasts can 
be compared. 

One of the standard tools used by meteorologists to answer the question o f which of tw o binary proba- 
bilistic forecasts is the b etter is the B rier score, first used over 50 years ago llBrierl.ll95nl) . and still in use 
today (see, for example. IVitartI l)2003|) . page 25). Nevertheless, we are going to argue that the Brier score 
is flawed. This is not something that can be proven mathematically, of course. Our arguments will be 
based on an appeal to intuition: we will present a simple case in which we believe it is intuitively clear 
which of two forecasts is the better, and we will show that the Brier score then comes to the opposite 
conclusion to our intuition i.e. it gives the wrong answer. We will then present an alternative score that 
overcomes this problem, that has a definition that accords more clearly with intuition, and that is also 
more firmly grounded in standard statistical theory. 
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good example is the root mean square error, which is a general measure used for comparing forecasts for the 
expectation 



2 The Brier Score 



The Brier score for a binary event is defined as: 

&=<(/- of > (1) 

where / is a forecast for the probabihty that an event X will happen, and o is an observation which 
takes the value 1 if the event happens and O otherwise. Lower values of the Brier score indicate better 
forecasts. A detailed discussion is given in lToth et al.. (,2003.1 . 
We can expand the Brier score as: 

h=< f > -2< fo> + <o^ > (2) 

When we are comparing two forecasts on the same observed data set the difference in the Brier score is 
given by: 

&2 - 6i /I > -2 < /20 > - < fl > +2 < ho > (3) 

where the < > term has cancelled because it is the same for both forecasts. If this difference is positive 
(62 > bi) then we conclude that hi is the better forecast. 

A particularly simple case is where the forecast probabilities have constant values, giving: 

b2-b,= /I - 2/2 < o > -ff + 2/1 < o > (4) 

A further simplification is possible if the event occurs with a constant probability p, in which case 
< o p and 

b2-b,= /I - 2/2P - A' + 2/ip (5) 



3 A simple example 

We now consider a very simple example, with constant event probability and constant forecast probabil- 
ities. We set p — j^^, and consider the forecasts /i = and f2 — \- 
In this case the difference between the Brier scores is given by: 

In-bi = /J - 2/2P - /? + 2/ip (6) 
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The Brier score leads us to conclude that forecast /i is the better forecast. However, this does not agree 
with our intuition. Forecast /i is a disaster: it predicts a zero probability (a very strong statement!) 
for something that happens not infrequently. Forecast /i is completely invalidated whenever event X 
actually occurs (on average, 1 in every 10 trials). Forecast A, on the other hand, is not so bad. It gives 
a lowish probability for something that does indeed occur with a low probability. Its only fault is that 
the probability is not exactly correct. 

The reason that the Brier score makes this mistake is that it does not penalise forecasts that predict 
a zero probability strongly enough when they are wrong, even though our intuition tells us that they 
should be heavily penalised. More generally, the Brier score does not penalise forecasts that give very 
small probabilities when they should be giving larger probabilities to the same extent that we penalise 
such forecasts with our intuition. This is because the Brier score is based on a straight difference between 
/ and o. Our intuition, on the other hand, considers the difference between probabilities of 0% and 10% 
to be very different from the difference between probabilities of 40% and 50%. Intuition apparently uses 
fractional or logarithmic rather than absolute differences in probability. 

One can easily construct other examples that illustrate this point. The more extreme the events consid- 
ered, the more striking is the problem with the Brier score. Consider, for example, p — j^qq, fi — and 
/2 = Again the Brier score prefers /i, while our intuition considers /i to be a failure, and /2 to be 
a reasonably good attempt at estimating a very small probability. 

We conclude that the Brier score cannot be trusted to make the right decision about which of two forecasts 
is better. It should also not be used to calibrate forecasts or evaluate forecasting systems since it will 
over-encourage prediction of very small or zero probabilities. We need a different measure. 



4 The likelihood score 



The standard measure used in classical statistics for testing which of two distributions gives the best fit 
to data is the likelihood L defined as the pr obability (or probability density) of the obseryations giyen 
the model and the parameters of the model ijFisheii Il922^ . In our case this becomes the probability of 
the observations given the forecast. 

We advocate the likelihood as the best metric for calibrating and comparing continuous probabilistic 
forecasts (see the previous citations) mainly on the basis that it is very intuitively reasonable: the 
forecast that gives the highest probability for the observations is the better forecast. We also advocate 
the likelihood as the best metric for calibrating and comparing binary forecasts. In this case the likelihood 
is given by: 

L^p{x\f) (7) 

where x is the full set of observations and / is the full set of forecasts. If we assume that the forecast 
errors are independent in time then this becomes: 

L = n^=>(x,|/.) (8) 
= n^=>./. + (i-oO(i-/;) 

We can also use the log-likelihood, which gives a more compressed range of values, and is given by: 

/ = InL (9) 
= ^ri[n^Z>J. + (l-o,)(l-/,)] 
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If we put all cases of event X occuring into set A, and all cases of event X not occuring into set B then: 

L = n^/,nB(i-/,) (10) 

and 

A B 

If we now consider the special case in which / is constant then: 

L = ft{l-.U)' (12) 

and 

/ = alnf, + bln{l - /,) (13) 

where a is the number of occurences of X, 6 is the number of occurences of not X, and b = n — a. 
If any of the predictions / are or 1 (i.e. are completely certain) then L = and / = —oo. If not, then 
L > and I > —oo. We see that use of the likelihood penalises the use of probability forecasts with 
values of or 1 very heavily. Such forecasts get the worst possible score, as they should (since one can 
never be completely certain). 

In our simple example the difference in likelihoods for the two forecasts is: 

L^-L.^m-h)' (14) 

Since this is positive for all samples we see that the likelihood concludes that forecast 2 is better, in line 
with our intuition. 



5 Summary 

Meteorologists have used the Brier score to compare binary probabilistic forecasts for over 50 years. 
However, we find that in simple cases it makes the wrong decision as to which is the better of two 
forecasts (where we define wrong in terms of our intuition). We reach this conclusion independently of 
any detailed analysis of the preferences of the user of the forecast. 

We advocate scores based on the likelihood as a replacement for the Brier score. On the one hand the 
likelihood is conceptually simpler than the Brier score: it decides which forecast is better simply according 



to which forecast gives the higher probabihty for the observed data, which seems immediately reasonable. 
On the other hand the likelihood accords with our intuition in the simple example that we present, and 

punishes forecasts that give probabilities of and 1 appropriately. 

We conclude that use of the Brier score should be discontinued, and should be replaced by a score based 
on the likelihood. 
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